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Abstract Multigrid algorithms are among the fastest iterative methods known 
today for solving large linear and some non-linear systems of equations. Greatly 
optimized for serial operation, they still have a great potential for parallelism 
not fully realized. In this work, we present a novel multigrid algorithm designed 
to work entirely inside many-core architectures like the graphics processing 
units (GPUs), without memory transfers between the GPU and the central 
processing unit (CPU), avoiding low bandwitdth communications. The algo- 
rithm is denoted as the high occupancy multigrid (HOMG) because it makes 
use of entire grid operations with interpolations and relaxations fused into one 
task, providing useful work for every thread in the grid. For a given accuracy, 
its number of operations scale linearly with the total number of nodes in the 
grid. Perfect scalability is observed for a large number of processors. 
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1 Introduction 

The multigrid algorithm [l]|2] is one of the fastest methods for solving lin- 
ear and non-linear systems of equations derived from a variety of problems, 
like numerical discretizations of partial differential equations and non-linear 
variational problems [3,4 . The main idea is to accelerate the convergence of 
an iterative method using a hierarchy of nested discretizations to perform 
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resolution-dependent corrections that are passed between levels using interpo- 
lations. Multigrid has the main advantage over other methods that when used 
to solve the problems with a given accuray, its number of operations often scale 
linearly with the number of discrete nodes used. It has been widely studied 
and optimized for sequential operation during the last decades, providing very 
fast convergent algorithms that mostly lack fine grain parallelism because of 
the sequential use of updates during the same iteration. 

In the last few years, hardware aspects have led to a paradigm change in the 
design of numerical methods. Performance improvements now come by paral- 
lelization and specialization and not any more by frequency scaling. Many-core 
parallel architectures, like GPUs, seem ideally suited for the acceleration of 
iterative methods like multigrid, given that its basic operations are essentially 
local and explicit. 

Early works on multigrid for GPUs date back to 2003 [5','6^, when GPUs 
started to outperform CPUs and control over the operations and memories 
of the GPU were available using programmable vertex shaders. These imple- 
mentations are many-core maps of a classic multigrid algorithm using recur- 
sive V-cycles. Additional interesting works in this direction are found in [7l[8l 
[QlfTO . and more recently using Nvidia's compute unified device architecture 
(CUDA) in [nirr2irr3lfT4lll5[ll6j . Nevertheless, a logical consequence of the use 
of a classic multigrid in many-core architectures is the appearance of perfor- 
mance penalties for coarse grids where the number of independent operations 
is reduced and memory latencies cannot be hided using multithreading. 

In order to avoid these penalties, a hybrid approach is preferred by some 
authors pTlfTSl fTO] where the CPU is used for serial matrix inversions over 
coarse grids and the GPU for computing fine-grid relaxations. This is a natural 
way to avoid the low parallelism of the coarse grids but with the penalty of the 
memory exchange between the random access memory (RAM) of the CPU and 
that of the GPU, which can be several orders of magnitude lower in bandwidth 
than the GPU fast off-chip memory. 

In this work, we modify the classic multigrid algorithm giving priority to 
keeping the data on the GPU, exploiting its fast memory levels and maintain- 
ing a constant occupancy. The result is a simple, fully parallel and perfectly 
scalable multigrid solver that works entirely inside the GPU without the need 
to communicate data to the CPU. We prove the perfect scalability of the algo- 
rithm using several GPUs with different numbers of processors. The extension 
to multiple GPUs and clusters is beyond the scope of this work. 



2 A high occupancy many-core multigrid (HOMG) 

Linear multigrid algorithms are iterative methods to solve large linear systems 
of the form 

Au = f , (1) 
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using a hierarchy of discretizations in nested grids where level-dependent re- 
strictions P, relaxations and interpolations S are performed in a variety of 
cycles. 

Classic V-cycles F (7^1,7^2) consist of the recursive application of the fol- 
lowing two- level correction scheme: (a) rji pre-smoothing relaxations Vh ^ 
Ft{'ni)^h] (b) the restriction of the residual Vh = fh — ^h^^h to the next coarser 
grid r2h ^ P^h] (c) the solution of the residual equation A2he-2h = T2h] (d) 
the interpolation of the error to the fine grid and the correction of the solu- 
tion Vh ^ Vh ^ Se2h] and (e) 7^2 post-smoothing relaxations Vh ^ R{r]2)vh- 
Usually, interpolations are linear, restrictions are linearly weighted averages 
of neighboring sources and relaxations are one of the variations of Jacobi or 
Gauss-Sidel methods [1 . 

A fundamental issue in many-core architectures like CPUs is the efficient 
use of its fast parallel fetches, optimized in every warp for contiguous memory 
locations and a large number of threads. Massive multithreading and the fast 
memory cache levels must be efficiently used to reduce cache misses and mem- 
ory latencies. With this in mind, it is logical that some problems appear in 
multigrid algorithms as the grid is coarsened because the number of threads, 
proportional to the number of points in the grid, will be quite low, sometimes 
even lower than the number of processors available, and the processors will 
have very small amounts of work or will idle. Even colored Gauss-Sidel meth- 
ods [1. could reduce the number of parallel threads and complicate the memory 
access patterns. The coarsening of the grid and the consequent reduction of 
workload, basic properties of multigrid, become less attractive under the optic 
of CPUs, where operations over the entire grid are optimal and inexpensive in 
one, two and three-dimensional structured grids. 

A multigrid algorithm that uses the entire grid for every iteration looks 
contradictory at first sight because it has more operations than the classic ver- 
sion, but it is attractive in many-core architectures because it keeps a constant 
optimal memory model fully exploiting its parallel fetching and processing ca- 
pabilities. One way to provide useful work for every thread during a coarse grid 
correction, is to fuse relaxations and interpolations in one task routine. Fur- 
thermore, we don't need to allocate different meshes because we could always 
act simultaneously in complementary subsets of the entire grid. One subset of 
points computes relaxations while the rest perform interpolations of the pre- 
vious iteration variations. This strategy provides a constant high occupancy 
of the cores. 

The damped Jacobi relaxation [1 is a natural choice for a fully parallel 
algorithm that only uses data produced in a previous iteration. This method 
is equivalent to the explicit integration of a discretized version of 

^ = «(£(«)-/), (2) 

where jC{u) is any elliptic partial differential operator. For example, if jC{u) = 
V^ii and the Equation (j2j) is discretized using centered spatial differences. 



4 




Fig. 1 Modified full multigrid cycle (MFMG). The number of iterations per level are dou- 
bling for every finer level until reaching a maximum number (MaxI). 

forward Euler integration in time and nested meshes doubling the spatial dis- 
cretization, the relaxation at level m is given by 

<r = <i +^ - <-i,j - - <J-i - <J+i) - htPmiki)) , (3) 

where I = 2'^ ^ hi = and Pm is the weighted average restriction to level 
m. 

The high occupancy multigrid (HOMG) algorithm we propose is quite sim- 
ple: (a) Tji relaxations-interpolations on the coarsest level V2My^ ^ R{r]i)v2Mf^ 
and Vh ^ Sv2My^] (b) 7^2 relaxations-interpolations on the next finer level 
V2(M-i)f^ ^ R{r]2)v2(M-i)h and Vh ^ Sv2(M-i)f^; (c) go back to a coarser level 
or descend to a finer level in a chosen pattern until reaching the finest level. 

We have explored many sequences of coarse to fine grid levels and found 
the most efficient to be the modified full multigrid cycle (MFMG) presented 
in Fig. [U with only two iterations in the coarsest level, increasingly doubhng 
them for every finer level until reaching a chosen bound MaxI. 

The restrictions are independent of the cycle and can be done in a pre- 
processing step using several independent arrays in memory. If memory savings 
are desirable, the restrictions can be done inside the cycle with just one extra 
memory array. The restrictions are performed inside the GPU using linear 
weighted averages organized as a series of nested partial reductions over the 
entire mesh. 

After an MFMG cycle is complete, the residual equation is used to restart 
the cycles as many times as needed to reach any desired accuracy. 

3 Power of two size grids 

Inherently to their architecture, GPUs perform optimally in grids with a power 
of two number of points per side while multigrid algorithms are usually limited 
to nested grids. We have designed a geometric strategy to allow the use of the 
HOMG in a two-dimensional grid with a power of two points per side. The 
procedure is illustrated in Fig. [2l 



Fig. 2 Geometric strategy for multigrid in power of two grids. The grid is divided into four 
subdomains with a power of two number of points per side. Each set of symbols represents the 
location of the coarse 2h grid in each subdomain. Some of them lie outside the subdomain. 
During a 2h relaxation-interpolation iteration, relaxations are performed over all the points 
with a symbol and interpolations on the rest. The interpolations avoid cross-domain data. 

The grid is divided into four subdomains with a power of two nodes per 
side. The coarse grid points are marked for each subdomain relative to the 
corresponding external corner. These marked points would form a unique set 
in a nested configuration but here they form four different sets. The coarse 
subsets need to include central points, if not, some coarse modes, like the 
zero mode, can not be resolved. Due to the absence of central points we mark 
the coarse grid points that lie just outside the corresponding subdomain. We 
perform relaxations in all the marked points and interpolations on the rest of 
the fine grid. The interpolations make use only of points in the same subdomain 
to avoid cross-domain interpolations. 

4 Results 

We test our algorithm solving the two-dimensional Poisson equation 

V\ = f (4) 

on the unit square i? = [0, 1] x [0,1] with Dirichlet boundary condition u = 
on the edge dO. 

We choose the polynomial analytic solution 

u{xi,X2) = -xlxl{l - x\){l - xl) (5) 

corresponding to the source 

f{xi,X2) = -2{xl{l - 6xl){l - xl) + xld - QxDd - xD). (6) 

This solution has a wide spectrum with components in all the modes of the 
grid. This characteristic is quite useful to study the convergence of multigrid 
algorithms like the HOMG, and has been previously used in [1 . 

Fig. [3] shows the sequential approximation of the solution with the HOMG 
algorithm in one MFMG cycle with a maximum of two iterations per level 
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Fig. 3 High occupancy multigrid (HOMG) approximation to the solution of Eq. with 
the polynomial source (|6} in a 128^ grid. The approximation is shown for different iterations 
of the modified full multigrid cycle (MFMG) with a maximum of two iterations per level 
{MaxI = 2). 

{MaxI = 2) and a 128^ grid. During the first iteration we see a spike because 
only the middle points perform a relaxation and the interpolations have null 
sources. Remember that the interpolations are delayed one iteration because 
they are performed in parallel together with the relaxations. The second it- 
eration shows that the relaxations have stalled while the interpolations form 
the zero mode approximation with pyramid shape. The third iteration shows 
clearly the next finer level relaxations-interpolations and so on. The MFMG 
cycle has a total of 31 iterations for a 128^ grid. 

The error of the approximations is computed using the normalized Li -error 



Fig.|4]shows the convergence of the Li-error for the HOMG method for one 
MFMG cycle and two different choices of the maximum iterations per cycle 
MaxI. The results show all the desirable properties of a multigrid algorithm. 
Cycles with the same MaxI descend over the same convergence path for dif- 
ferent resolutions. Higher resolutions continue the smoothing of the solution 
obtained with coarser grids. Increasing MaxI allows the cycles to reach lower 
errors with the penalty of a large number of iterations per cycle. MaxI = 4 
reaches an error of 5% and MaxI = 32 reaches an error of 1.6%. Operations 
over the entire grid are denoted as a working unit (WU). Given a desired ac- 
curacy, the algorithm performs a number of operations linearly proportional 
to the points in the grid. Remember that after an MFMG cycle is complete, 
the residual equation is used to restart the cycles as many times as needed to 
reach any desired accuracy. 

We explore the scalability of the HOMG algorithm on GPUs with different 
architectures and number of processors. The test consists of one MFMG cycle 
with MaxI = 32 and different resolutions. The GPUs used are all GeForce 
from Nvidia. We use C language for CUDA as the application programming 
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interface. Five GPUs of the modified 8-series architecture: a 310M with 16 
processors, a 9500GT with 32 processors, a 330M with 48 processors, a 9800GT 
with 112 processors and a GTX260 with 216 processors. And two GPUs of the 
GeForce Fermi architecture: a GTX460 with 336 processors and a GTX480 
with 480 processors. All the computations are done in single precision. 

The scalability results are shown in Fig. [5l The slopes are moderate for a 
low number of processors and small grids. For a large number of processors 
and large grids the algorithm reaches perfect scalability. 
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5 Conclusion 

In this article, we present a novel multigrid algorithm specially designed to 
work entirely inside many-core architectures like GPUs without memory trans- 
fers between the GPU and the CPU. The algorithm makes use of entire grid 
operations even for coarse grid corrections. Interpolations and relaxations are 
fused into one task giving useful work for every thread in the grid. In this way 
the algorithm has full multithreading and fix memory patterns, allowing the 
full exploitation of the fast memory models of the GPU, efficiently hiding cash 
misses and memory latencies. 

The algorithm is denoted as the high occupancy multigrid (HOMG) algo- 
rithm because multithreading and useful work per thread are kept constantly 
high. The algorithm is combined with a modified full multigrid cycle (MFMG) 
to reach a high efficiency. For a given accuracy, the operations of the HOMG 
scale linearly with the total number of nodes. Perfect scalability is observed 
for a large number of processors and large grids. 
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